Remember and Forget for Experience Replay (ReF-ER)

1 Overview

Remember and Forget for Experience Replay (ReF-ER) dismisses transtions from “far policy”. The key metrics, importance (\(\rho_t\)), is the ratio between the probability of selecting (\(a_t\)) with the current policy (\(\pi ^{w}\)) and with the behavior policy (\(\mu _t\)); \(\rho _t = \pi (a_t\mid s_t) / \mu _t (a_t\mid s_t)\).

If \(1/c_{\text{max}} < \rho _t < c_{\text{max}}\) then it is classified as “near-policy”, otherwise “far-policy”. The gradients (\(\hat{g}(w)\)) computed from far-policy are clipped to 0.

Additionally, penalty term (\(\hat{g}^D(w)\)) is defined as follows; \(\hat{g}^D(w)=E[\nabla D_{\text{KL}}(\mu _k ( \cdot \mid s_k)\| \pi ^w ( \cdot \mid s_k))]\).

These two terms are added with annealing parameter \(\beta\);

\[ \hat{g}^{\text{ReF-ER}}(w) = \beta \hat{g}(w) + (1-\beta) \hat{g}^D(w) \]

The \(\beta \) is updated at each step by following rule;

\[ \beta \leftarrow \cases{ (1-\eta )\beta & if \(n_{\text{far}} /N > D\) \cr (1-\eta )\beta + \eta & otherwise} \]

where \(\eta\) is learning rate of neural network, \(N\) is the number of total samples in the replay buffer, \(n_{\text{far}}\) is the number of far policy samples in the replay buffer, and \(D\) is a hyperparameter.

2 With cpprb

Under investigation. It is still not clear how do the authors sample from replay buffer. We continue to investigate their code.

3 References